Hugging Face Transformers Template

LLM 最火的应用是聊天。在聊天场景下，模型输入不是简单的字符串，而是由一条或多条消息组成，每条消息包含角色（role）以及内容（content）。因此，衍生出聊天输入模板的概念。

Hugging Face Transformers 通过添加 Template 功能，实现对这一概念的支持。聊天输入模板是 Tokenizer 的一部分。

通过Transformers 的 Template 功能，能够让我们对预训练模型继续训练时，遵循一致的模板格式。

Warning

如果增量训练、微调的模版，与模型预训练时的模板不一致，将会损坏模型能力！

本文是对《Templates for Chat Models》文档的阅读笔记，加上个人的一些实践经验。对于完整功能描述，请参见文档。

示例

木板的使用方法很简单，

对聊天对话应用模板：只需使用 role 和 content 键构建消息列表，然后将其传递给 apply_chat_template() 方法。

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

chat = [
   {"role": "user", "content": "Hello, how are you?"},
   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
   {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

tokenizer.apply_chat_template(chat, tokenize=False)

# 得到
# " Hello, how are you?  I'm doing great. How can I help you today?   I'd like to show off how chat templating works!</s>"

这个模板比较简单，只是给内容添加空格，并拼接为一个字符串。但注意，在末尾添加了 </s>，表示内容结束。

如果换用 Mistral-7B-Instruct-v0.1 模型：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

tokenizer.apply_chat_template(chat, tokenize=False)
# 得到
# "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

可以看到拼接的模板更加复杂，添加了 [INST] 控制标记，指示聊天内容的开始与结束。

从中可以看出，不同模型的 tokenizer，其模板格式也是不同的。

add_generation_prompt

在《Chat Template Upgrades · Issue #26539 · huggingface/transformers》一文中提到：

当您从模型生成响应时，您希望提示包含消息历史记录，以及指示机器人响应开始的标记。这可以确保模型确实回复您，而不是继续用户响应或类似的内容。

但是，在其他情况下，我们不希望模板执行此操作。例如，在格式化消息以进行训练时，您不希望在末尾添加任何额外的生成提示。

文中给出一个示例，考虑标准 ChatML 模板：

<|im_start|>user
Hi, this is a user message.
<|im_start|>bot
Hi, this is a bot reply.
<|im_start|>user
Hi, this is the next user message.

但是，如果用户想要生成机器人回复，则他们提示的实际输入应以 <|im_start|>bot 结尾：

<|im_start|>user
Hi, this is a user message.
<|im_start|>bot
Hi, this is a bot reply.
<|im_start|>user
Hi, this is the next user message.
<|im_start|>bot

这样做的原因是您希望机器人编写机器人响应而不是继续用户消息，或者编写一些其他特殊标记，或者任何其他类似的奇怪的事情。但是，当使用 apply_chat_template 格式化聊天数据进行训练时，您不希望在末尾添加<|im_start|>bot，因为您不想生成更多文本。

末尾是否添加 <|im_start|>bot，则由 add_generation_prompt 进行控制：

>>> tokenizer.apply_chat_template(messages, tokenize=False)
"""<|im_start|>user
Hi, this is a user message.
<|im_start|>bot
Hi, this is a bot reply.
<|im_start|>user
Hi, this is the next user message."""

>>> tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
"""<|im_start|>user
Hi, this is a user message.
<|im_start|>bot
Hi, this is a bot reply.
<|im_start|>user
Hi, this is the next user message.
<|im_start|>bot"""

从中可总结出：如果用于生成（聊天对话），需要设置 add_generation_prompt=True，如果用于训练，需要设置其为 False。

网络资源

本文作者：Maeiee

本文链接：Hugging Face Transformers Template

版权声明：如无特别声明，本文即为原创文章，版权归 Maeiee 所有，未经允许不得转载！

喜欢我文章的朋友请随缘打赏，鼓励我创作更多更好的作品！